The Used Car Price Prediction dataset contains 4,009 vehicle listings collected from the automotive marketplace cars.com. Each row represents a unique car and includes nine key attributes relevant to pricing and vehicle characteristics. Dataset is taken from Kaggle: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset
The dataset provides information on:
Brand and model – manufacturer and specific vehicle model
Model year – age of the car, influencing depreciation
Mileage – an indicator of usage and wear
Fuel type – e.g., gasoline, diesel, electric, hybrid
Engine type – performance and efficiency characteristics
Transmission – automatic or manual
Exterior/interior colors – aesthetic properties
Accident history – whether the car has previously been damaged
Clean title – legal/ownership status
Price – listed price of the vehicle
Overall, the dataset offers a structured overview of key features that influence used car valuation. It is well-suited for analytical tasks such as understanding pricing drivers, exploring consumer preferences, and building predictive models for vehicle prices. # Raw data
We load the original CSV directly from the project data folder using
here() so paths work regardless of the working
directory.
raw_path <- here("data", "raw", "used_cars.csv")
cars_raw <- readr::read_csv(raw_path, show_col_types = FALSE)
Basic structure and summary statistics of the raw dataset:
glimpse(cars_raw)
## Rows: 4,009
## Columns: 12
## $ brand <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", …
## $ model <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35…
## $ model_year <dbl> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202…
## $ milage <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "…
## $ fuel_type <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol…
## $ engine <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "…
## $ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed…
## $ ext_col <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi…
## $ int_col <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl…
## $ accident <chr> "At least 1 accident or damage reported", "At least 1 acc…
## $ clean_title <chr> "Yes", "Yes", NA, "Yes", NA, NA, "Yes", "Yes", "Yes", "Ye…
## $ price <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$…
We base the EDA on the engineered dataset
(data/processed/used_cars_features.csv) that keeps cleaned
numeric fields and derived features like age, mileage in thousands, and
accident flags.
features_path <- here("data", "processed", "used_cars_features.csv")
cars <- readr::read_delim(features_path, delim = ";", show_col_types = FALSE)
| variable | median | mean | p25 | p75 | sd | min | max |
|---|---|---|---|---|---|---|---|
| price_dollar | 28000.00 | 36865.68 | 15500.00 | 46999.00 | 36531.16 | 2000.0 | 649999.00 |
| log_price | 10.24 | 10.19 | 9.65 | 10.76 | 0.82 | 7.6 | 13.38 |
| age | 9.00 | 10.32 | 6.00 | 14.00 | 5.87 | 1.0 | 29.00 |
| milage_k | 63.00 | 72.14 | 30.00 | 103.00 | 53.60 | 0.0 | 405.00 |
| horsepower | 310.00 | 331.51 | 248.00 | 400.00 | 120.32 | 76.0 | 1020.00 |
| accident | n | share |
|---|---|---|
| At least 1 accident or damage reported | 871 | 0.28 |
| None reported | 2194 | 0.72 |
Median listing sits around $28k, with the middle 50% between roughly $15.5k and $47k, while the maximum reaches $650k—explaining the heavy right tail. Median age is 9 years (IQR: 6–14), typical mileage is about 63k miles (IQR: 30k–103k), and horsepower clusters around 310 HP (IQR: 248–400). About 28% of cars report an accident or damage, a meaningful factor for pricing.
Raw prices are extremely right-skewed, with most listings below $80k but a long tail of luxury and exotic vehicles. Modeling on this scale would be dominated by a few high-price outliers.
Log transformation produces a more bell-shaped distribution and stabilizes variance, making linear-style models and visual comparisons more reliable.
Prices decline with age across fuels. Electric listings start high but show the sharpest early drop; diesel holds comparatively high prices across ages (though the diesel sample is small), and gasoline sits lower overall.
Among the 12 most common brands, Porsche leads on median price, followed by Land Rover and Mercedes-Benz; Volume brands (Toyota, Nissan, Jeep) cluster lower with tighter spreads, while some (Chevrolet, Ford) span broader lineups.
Higher mileage correlates with lower prices. We use a loess smoother (not a straight trendline) and cap the x-axis at 250k miles to reduce the influence of extreme outliers; automatics show a steady decline, and the smaller manual subset is noisier but similar in direction.
Cars with reported accidents trade at a clear discount relative to clean histories, even after log-scaling prices, confirming accident history as an important predictor.
We fit radial-kernel SVM regressors on log_price using
both e1071::svm and kernlab (via
caret::train). Both models use the same 80/20 train-test
split; hyperparameters are tuned by cross-validation and evaluated on
the hold-out test set below.
svm_metrics <- readr::read_csv(
here("report", "models", "svm", "svm_log_price_metrics.csv"),
show_col_types = FALSE
)
svm_best <- readr::read_lines(here("report", "models", "svm", "svm_best_model.txt"))[1]
svm_metrics_wide <- svm_metrics |>
tidyr::pivot_wider(names_from = .metric, values_from = .estimate)
knitr::kable(svm_metrics_wide, digits = 3, caption = "Test metrics for SVM variants (target: log_price)")
| .estimator | model | rmse | mae | rsq |
|---|---|---|---|---|
| standard | e1071_radial | 0.290 | 0.209 | 0.868 |
| standard | kernlab_radial | 0.354 | 0.251 | 0.803 |
The e1071 radial SVM is best by RMSE (~0.290) and R²
(~0.868), outperforming the kernlab variant on this split. SVMs do not
yield straightforward coefficient interpretations; they learn support
vectors and decision functions in a transformed feature space. To
understand feature effects you would rely on downstream tools (e.g.,
partial dependence or SHAP), but within this report we focus on
comparative error metrics and note that the tuned radial kernel captures
non-linear relationships beyond the linear/log models.
Cross-validation setup: both SVMs were tuned with 3-fold cross
validation on the training set (same 80/20 split for both).
e1071::tune() searched a compact grid of cost
and gamma; caret::train(method = "svmRadial")
searched a grid of C and sigma. Final metrics
shown above are from the untouched test set, so cross validation was
only for hyperparameter selection.
Best model: e1071_radial with RMSE = 0.2898 (lower is better)
Two regressors on log_price: a caret nnet
(with dummying + scaling, 3-fold cross validation over size/decay) and a
manual neuralnet (shallow hidden layer). Metrics below come
from the test split; cross validation was used only for tuning.
nn_metrics <- readr::read_csv(
here("report", "models", "nn", "nn_log_price_metrics.csv"),
show_col_types = FALSE
)
nn_best <- readr::read_lines(here("report", "models", "nn", "nn_best_model.txt"))[1]
nn_metrics_wide <- nn_metrics |>
tidyr::pivot_wider(names_from = .metric, values_from = .estimate)
knitr::kable(nn_metrics_wide, digits = 3, caption = "Test metrics for NN variants (target: log_price)")
| .estimator | model | rmse | mae | rsq |
|---|---|---|---|---|
| standard | caret_nnet | 0.369 | 0.265 | 0.791 |
| standard | neuralnet_manual | 0.408 | 0.285 | 0.743 |
Best NN model: caret_nnet with RMSE = 0.369 (lower is better) — the
manual neuralnet edge is small (RMSE ≈ 0.388 vs. 0.392), so
both nets are in a similar error band; the manual net slightly reduces
bias on the high end (see fewer large positive residuals).
In Section 3 we saw that price is strongly right-skewed and that log(price) has an approximately linear relationship with age. Based on this, we now fit a linear regression model with log(price) as response.
The goal is not primarily to build the most accurate predictor, but to understand which factors drive used-car prices and in which direction.
We model the natural logarithm of the price instead of the raw price because:
We use the feature dataset created in the previous step
(data/processed/used_cars_features.csv) and fit the
following multiple linear regression model on the log-price:
log_price ~ age + milage_k + accident_bin + brand + fuel_type + transmission + ext_col + int_col
Here
log_price is the natural logarithm of the car price in
dollars (the target variable / response).age is the car age in years.milage_k is the mileage in thousands of miles.accident_bin is a binary variablebrand, fuel_type,
transmission, ext_col and int_col
are categorical predictors and are represented in the model by dummy
variables with one reference category each.The coefficients of this model measure how much the expected log-price changes when we increase a numeric predictor by one unit (or switch a dummy variable from 0 to 1), while keeping all other variables fixed.
We implement and fit the model by sourcing the script
src/04_model_linear.R, which
used_cars_features.csv),stats::lm(...), and# Fit the baseline linear regression model and prepare lm_linear_data
source(here::here("src", "04_model_linear.R"))
# Show a compact summary of the model
broom::tidy(lm_linear) |>
dplyr::slice(1:10) # show only first 10 coefficients
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.4 0.115 98.9 0
## 2 age -0.0578 0.00188 -30.8 5.45e-181
## 3 milage_k -0.00599 0.000198 -30.3 3.05e-176
## 4 accident_bin -0.0801 0.0174 -4.61 4.28e- 6
## 5 brandAlfa -0.0619 0.146 -0.424 6.71e- 1
## 6 brandAudi 0.225 0.0728 3.10 1.98e- 3
## 7 brandBentley 1.27 0.111 11.4 1.32e- 29
## 8 brandBMW 0.281 0.0688 4.08 4.61e- 5
## 9 brandBuick -0.172 0.119 -1.44 1.49e- 1
## 10 brandCadillac 0.276 0.0788 3.50 4.77e- 4
Because the model is fitted on the log(price) scale, each coefficient can be read approximately as a percentage change in price when we increase that variable by one unit (or switch a dummy variable from 0 to 1), while keeping all other variables fixed. Roughly, a coefficient of −0.06 means “about −6 %”, a coefficient of +0.20 means “about +22 %”, and so on.
Below we interpret a few selected coefficients from the model.
Age (coefficient ≈ −0.058)
Holding all other variables constant, increasing the age of a car by one
year reduces the expected price by about 6 %
(exp(−0.058) ≈ 0.94).
→ Older cars are substantially cheaper, as expected.
Mileage in thousands (coefficient ≈
−0.006)
An additional 1,000 miles reduces the expected price by roughly
0.6 %.
→ The effect of mileage is noticeable but smaller than the effect of
age.
Accident history (accident_bin, coefficient
≈ −0.080)
accident_bin = 1 indicates that at least one accident or
damage has been reported. Compared to a car with no accident history
(accident_bin = 0), the expected price is lower by about
8 %.
→ Cars with an accident history sell for clearly lower prices.
Brand examples
The brand coefficients show how each brand differs from the (omitted)
reference brand, holding age, mileage, accident history and all other
variables constant.
brandBMW (coefficient ≈ 0.28): price is about
30 % higher than for the reference brand.
→ BMW cars are noticeably more expensive, even after controlling for
other factors.
brandFerrari (coefficient ≈ 1.82): price is more
than six times higher (over +500 %) than for the
reference brand.
→ Ferrari appears as an extreme premium brand in this dataset.
Fuel type example (fuel_typeElectric,
coefficient ≈ −0.79)
For purely electric cars, the coefficient corresponds to a price that is
roughly 55 % lower than for the reference fuel type,
given the same age, mileage, brand etc.
→ In this sample, electric cars are priced clearly below comparable
vehicles with the reference fuel type (possibly due to different model
mix, range concerns, or incentives on combustion cars).
Transmission (transmissionManual,
coefficient ≈ 0.20)
Manual transmission has a coefficient of about 0.20, which translates to
prices that are roughly 20–22 % higher than for the
reference transmission type.
→ Cars with manual transmission are on average more expensive in this
sample.
Overall, the signs and magnitudes of the coefficients are plausible: older and higher-mileage cars and cars with accidents are cheaper, while premium brands and certain configurations (e.g. BMW, Ferrari) command much higher prices. With an \(R^2\) of about 0.75 on the log-price scale, the model explains a large share of the variation in used car prices, even though there is still substantial unexplained variability.
To visualise the effect of age in the linear model, we plot log-price against age and add a fitted linear trend line.
The plot shows the expected negative relationship: older cars tend to have a lower log-price. In our model, the age coefficient is about −0.058, which means that, holding all other variables constant, one additional year of age reduces the expected price by roughly 6 % (because exp(−0.058) ≈ 0.94).
To check whether the linear model assumptions are roughly satisfied, we plot the residuals against the fitted (predicted) log-prices. Ideally, the points are scattered randomly around zero without a clear pattern.
In our plot, the residuals are roughly centred around zero with no
strong curvature. There is some increase in spread for higher predicted
prices, but overall the linear model assumptions appear acceptable.
Overall, the linear regression on log(price) provides a simple and interpretable summary of the main price drivers in this dataset. Age and mileage have the expected negative impact on prices, while an accident history leads to a price discount of roughly 8 %. Premium brands such as BMW and especially Ferrari command large price premia, even after controlling for age, mileage and fuel type. The model explains about 75 % of the variance in log-prices, which is quite high for real-world data, but the residual plot also shows that there is still substantial unexplained variability. For a client, this model could already be used as a rough pricing guideline, but more advanced models (e.g. with interactions or nonlinear effects) might capture the remaining structure in the data even better.
A Poisson GLM is typically used when the response variable is
a count (non-negative integers). Because this dataset does not
contain a natural event count, we use a pragmatic proxy: mileage
in thousands (milage_k), converted to an integer
“count-like” variable (milage_k_count).
This section mainly demonstrates the Poisson GLM workflow and
interpretation.
We model the expected mileage count (in thousands) using a Poisson GLM with a log link:
milage_k_count ~ age + accident_bin + brand + fuel_type + transmission + ext_col + int_col
In a Poisson GLM with log link:
milage_k_count (non-negative integer
proxy)log( E[milage_k_count] ) = linear predictorSo coefficients are interpreted
multiplicatively:
exp(beta) is the factor change in the expected count for a
one-unit increase (or dummy switch), holding other variables fixed.
We fit the Poisson GLM by sourcing
src/06_model_glm_poisson.R, which: 1) loads the feature
data,
2) creates milage_k_count,
3) fits a Poisson GLM with log link, and
4) adds predictions/residuals back to the data.
source(here::here("src", "06_model_glm_poisson.R"))
# compact coefficient table incl. significance stars
poisson_tbl <- broom::tidy(glm_poisson) |>
dplyr::mutate(
signif = dplyr::case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)
poisson_tbl |>
dplyr::select(term, estimate, std.error, statistic, p.value, signif) |>
dplyr::slice(1:12)
## # A tibble: 12 × 6
## term estimate std.error statistic p.value signif
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 3.79 0.0291 130. 0 ***
## 2 age 0.0676 0.000408 166. 0 ***
## 3 accident_bin 0.215 0.00453 47.5 0 ***
## 4 brandAlfa -0.564 0.0600 -9.39 5.92e- 21 ***
## 5 brandAudi -0.102 0.0191 -5.31 1.07e- 7 ***
## 6 brandBentley -0.941 0.0394 -23.9 6.01e-126 ***
## 7 brandBMW -0.161 0.0179 -8.99 2.48e- 19 ***
## 8 brandBuick -0.140 0.0303 -4.63 3.63e- 6 ***
## 9 brandCadillac -0.0474 0.0205 -2.31 2.08e- 2 *
## 10 brandChevrolet -0.233 0.0182 -12.9 7.88e- 38 ***
## 11 brandChrysler -0.209 0.0285 -7.32 2.45e- 13 ***
## 12 brandDodge -0.0818 0.0205 -3.99 6.72e- 5 ***
## AIC pseudo_R2_deviance dispersion_pearson
## 1 79885.43 0.4922187 20.4108
If the dispersion is clearly above 1, the Poisson variance assumption (mean ≈ variance) may be violated (overdispersion). In a real use case you’d consider a quasi-Poisson or negative binomial model. We keep Poisson here for comparability and interpretation practice.
In a Poisson GLM with log link:
exp(beta) > 1 → expected count increasesexp(beta) < 1 → expected count decreases(exp(beta) - 1) * 100## # A tibble: 4 × 5
## term estimate rate_ratio pct_change p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age 0.0676 1.07 7 0
## 2 accident_bin 0.215 1.24 24 0
## 3 fuel_typeElectric -1.01 0.365 -63.5 1.08e-190
## 4 transmissionManual -0.318 0.728 -27.2 0
The Poisson GLM demonstrates how to model a count-like response with
a log link and interpret effects as multiplicative changes via
exp(beta). While milage_k_count is not a true
event count, this workflow is useful to practice GLM estimation,
interpretation, and basic diagnostics. In a real client setting with
genuine count outcomes, overdispersion should be checked carefully and a
quasi-Poisson or negative binomial model may be more appropriate.